MOFSocialNet: Exploiting Metal-Organic Framework Relationships via Social Network Analysis

The number of metal-organic frameworks (MOF) as well as the number of applications of this material are growing rapidly. With the number of characterized compounds exceeding 100,000, manual sorting becomes impossible. At the same time, the increasing computer power and established use of automated machine learning approaches makes data science tools available, that provide an overview of the MOF chemical space and support the selection of suitable MOFs for a desired application. Among the different data science tools, graph theory approaches, where data generated from numerous real-world applications is represented as a graph (network) of interconnected objects, has been widely used in a variety of scientific fields such as social sciences, health informatics, biological sciences, agricultural sciences and economics. We describe the application of a particular graph theory approach known as social network analysis to MOF materials and highlight the importance of community (group) detection and graph node centrality. In this first application of the social network analysis approach to MOF chemical space, we created MOFSocialNet. This social network is based on the geometrical descriptors of MOFs available in the CoRE-MOFs database. MOFSocialNet can discover communities with similar MOFs structures and identify the most representative MOFs within a given community. In addition, analysis of MOFSocialNet using social network analysis methods can predict MOF properties more accurately than conventional ML tools. The latter advantage is demonstrated for the prediction of gas storage properties, the most important property of these porous reticular networks.


Introduction
The modelling and examination of complex systems that contain chemical, biological, ecological, economic, social, technological, and other types of information is a very challenging process if the number of elements in the system becomes very large. Metal-Organic Frameworks (MOFs), a class of chemical compounds composed of metal nodes connected via organic linker molecules, represent a particularly complex example from materials science. The wide variety of metal nodes and organic linker molecules suitable for MOF synthesis and the virtually unlimited number of linker/node combinations lead to an enormous size of the MOF chemical space. While the number of experimentally characterized MOFs already exceeds 100,000 [1], there is no upper limit for the total number of these reticular networks. This diversity in MOF chemistry makes it extremely difficult to navigate through the large design space and to identify MOF materials with suitable properties for a desired application or to identify most representative MOFs for an anticipated study, that cover best the available design space. Identifying an appropriate MOF for a given application in this huge chemical space can, in principle, be carried out by high throughput screening of existing or hypothetical MOFs databases. This screening can be carried our either experimentally or theoretically [2]. However, this approach becomes extremely costly with increasing size of the database. Recently, new paradigms for the discovery and rational design of materials have been established that are based on Machine Learning (ML) as well as sophisticated data science analysis methods and algorithms [3]. In the field of MOFs, first examples, where ML methods were employed to predict material properties or even to design new MOF structures and predict synthesis conditions have already been successfully demonstrated [4].
The simulation of the adsorption capacity for gases, probably the most important property of MOFs for existing applications, provides an interesting example of ML-based strategies for handing materials classes with large size. The presently most popular ML algorithms are Support Vector Machine, Random Forest and Neural Networks for predicting the absorption of guest molecules by MOFs [5]. In addition, deep learning, a particularly effective ML algorithm, has been used in a number of different applications [6]. For many practical applications, the stability of MOFs in aqueous environments is an important prerequisite and ML-based models could accurately predict water stability of MOFs [7].
When applying ML models, a first important step is the selection of descriptors. The number of the different parameters describing MOF structures is simply too large for a direct, straightforward analysis, therefore also ML-based workflows were introduced, that allow extracting the most valuable descriptors within a given family of MOFs [8].
Combining data mining and machine learning has allowed for the prediction of MOF synthesis [9] and MOF stability [10]. In another study, the authors proposed a machinelearning algorithm to predict the possibility of metal-linker combinations for the guest accessibility of MOFs. In this method, various ML models were evaluated to learn the connection between component chemistry and MOF properties without explicitly requiring a priori knowledge of the MOF structure [11].
Building a social network enables the use of machine learning techniques based on graph mining to extract valuable knowledge from the MOFs data. Our goal in this study is to demonstrate that "social networks" constructed from MOFs are a valuable tool for analysing large MOFs databases, allowing to navigate through the MOF chemical space, identify suitable MOFs for a given application or desired study, and to curate large datasets efficiently.
We used social network analysis (SNA), rather than more traditional machine learning algorithms, since SNA outperforms other ML models in visualizing complex relationships between different MOFs. Additionally, SNA allows to extract information about the properties of a given (e.g., so far unknown) MOF by its relationship to "neighbouring" (known) MOFs in the social graph. SNA therefore allows to extract useful information, e.g., find the most representative MOFs or identify implicit and hidden dependencies between MOFs.
Two primary types of SNA were performed as part of the current research: centrality and community detection. In the centrality analysis, parameters are determined that measure the characteristics of a given MOF node in the graph in relation with other MOF nodes (in this case other MOFs) in the graph. In MOFSocialNet, we deal with different types of centralities, degree centrality, and closeness centrality. Degree centrality focuses on the links of one MOF node to other MOF nodes. MOF nodes with a high degree centrality can be regarded as important MOF structures with similar characteristics to many other MOF structures in the dataset. The closeness centrality is computed by considering the average distance from the target MOF node to the other MOF nodes in the networks. MOF nodes with a high degree centrality can be regarded as very representative MOF structures for the given set of analyzed MOF structures [27]. Node centralities allow to identify the most important or influential node in a graph. For instance, a high value of degree centrality identifies nodes that are in the middle of the network. Thus, by blending the information provided by the different centralities allows to analyze the MOF networks and to find correlations between MOFs.
One other parameter, which we looked at in MOFSocialNet, is community detection [28][29][30][31][32][33][34]. Community detection is essentially a type of clustering problem. Community detection aims to group the nodes according to the relations between them to form strongly related sub-graphs from the entire graph. For example, detection of communities holds an important place in the analysis and functional prediction of the interaction networks between proteins and other molecules in biological cells [21] or to predict and identify disease genes.
In the context of MOFSocialNet, community detection can be applied to provide an overview and a structure to highly diverse MOF datasets. Social network analysis can therefore help to rationalize the categories used to describe MOF types, moving away from "most popular MOF types" towards categories based on more objective features or properties of the MOFs.

Methods
In this paper, we construct a social network called MOFSocialNet from geometrical descriptors of MOFs in the CoRE-MOFs database. MOFSocialNet is an undirected, weighted, and heterogeneous social network. Following the construction of MOFSocialNet, we provide a set of social network analytic processes to extract valuable knowledge from the MOF data using graph-mining algorithms. The full workflow of the MOFSocialNet is shown in Figure 1. In the first step, we created a feature vector for each MOF. In the following step, the similarity between each pair of MOF vectors is calculated based on vector similarity methods. In this step we created the MOFs social graph, named MOFSocialNet, where MOFs are the nodes and the similarity between the MOF feature vectors are the links (i.e., the relationships between MOFs). Finally, after removing irrelevant links in the graph, we applied social network analysis methods to extract valuable knowledge from this graph. In subsequent sections, all steps are described in detail.

Creating MOF Feature Vectors
A feature vector is an n-dimensional vector of numerical features that can e.g., describe an object in pattern recognition using machine learning. In the case of MOFs, every property depends on a set of specific descriptors, i.e., geometrical, chemical, topological, and energybased descriptors [35]. Before applying the data analysis process, it is critical to identify key descriptors that are highly correlated with the property of interest of the MOF.
For a demonstration of the proposed approach, we limited ourselves to a subset of 1000 MOFs in the CoRE-MOFs database and used eight geometric descriptors for MOF (see Table 1). Therefore, each MOF is assigned to an eight-dimensional feature vector.

Creating MOF Feature Vectors
A feature vector is an n-dimensional vector of numerical features that can e.g. de scribe an object in pattern recognition using machine learning. In the case of MOFs, ever property depends on a set of specific descriptors, i.e. geometrical, chemical, topologica and energy-based descriptors [35]. before applying the data analysis process, it is critic to identify key descriptors that are highly correlated with the property of interest of th MOF.
For a demonstration of the proposed approach, we limited ourselves to a subset o 1000 MOFs in the CoRE-MOFs database and used eight geometric descriptors for MO (see Table 1). Therefore, each MOF is assigned to an eight-dimensional feature vector.
In order to simplify the further analysis, we normalized the values of the numeric columns in the dataset to a common scale. The most common method of data normaliza tion is Min-Max normalization, which values are transformed into decimals between and 1, using formula 1: where and denote the minimum and maximum values of the correspondin property A. In the following, the original and normalized values are denoted by an , respectively. As a consequence, all v' adopt values from the intervall (0,1) [36]. In order to simplify the further analysis, we normalized the values of the numerical columns in the dataset to a common scale. The most common method of data normalization is Min-Max normalization, which values are transformed into decimals between 0 and 1, using formula 1: where min A and max A denote the minimum and maximum values of the corresponding property A. In the following, the original and normalized values are denoted by v and v , respectively. As a consequence, all v adopt values from the intervall (0,1) [36].
For the construction of the graph, a metric must be introduced to measure the distance between two vertices. In previous works, either direct (Euclidean or Manhattan metrics) or more invoked methods like Pearson's product-moment correlation coefficient (PPMCC) or Cosine methods have been used.
In the current study, we have used the so-called cosine metrics. Given two vectors of MOF descriptors, A and B, the cosine similarity, cos(θ), is computed as In this initial graph, the number of edges is too large for an efficient analysis and an effective strategy to remove weak links must be introduced. Therefore, we removed all edges with a length less than a threshold parameter d.
The representation of the sample network of MOFSocialNet after elimination of weak links using a value for d = 0.9999 (reducing the number of nodes to 2214) is shown in Figure 2. The presence of an edge between two MOFs thus indicates that the similarity is above the threshold value. An important property of a particular node is its degree, the number of it to other nodes. The probability distribution of these degrees over the who shown in Figure 3. The average degree based on the diagram is 46. An important property of a particular node is its degree, the number of edges linking it to other nodes. The probability distribution of these degrees over the whole network is shown in Figure 3. The average degree based on the diagram is 46.

Centrality Measures in MOFSocialNet
For the next step of the analysis, we determined the centrality of nodes. Centrality means the relative significance of nodes (or vertices) and links (or edges). In our context, centrality measures how similar a MOF is to other MOFs within MOFSocialNet.
The simplest approach is degree centrality, which for a given node is obtained by counting the number of links connecting to other nodes. In MOFSocialNet, a MOF with a high degree centrality has very similar properties to many other MOFs. For the MOFSocialNet graph (G), degree centrality (C d ) is defined as: where k i is the degree (number of edges connected to a node) of node i and n is the total number of nodes in the graph [35]. An important property of a particular node is its degree, the number o it to other nodes. The probability distribution of these degrees over the wh shown in Figure 3. The average degree based on the diagram is 46.

Centrality Measures in MOFSocialNet
For the next step of the analysis, we determined the centrality of nod means the relative significance of nodes (or vertices) and links (or edges). centrality measures how similar a MOF is to other MOFs within MOFSocia The simplest approach is degree centrality, which for a given node counting the number of links connecting to other nodes. In MOFSocialNet, high degree centrality has very similar properties to many other MOFs. Fo cialNet graph (G), degree centrality (Cd) is defined as: Where is the degree (number of edges connected to a node) of node an number of nodes in the graph [35].  Table 2 presents the 10 MOFs with highest degree and Figure 4 illustrates the MOFSo-cialNet with these MOFs being highlighted. The closeness centrality (C c ) of a vertex is a measure of the closeness of the vertex to the rest of the vertices in a graph. The C c of a vertex is computed as the inverse of the sum of the hop counts (farness) of the shortest paths from the vertex to the rest of the vertices in the graph. If d(i, j) is the geodesic distance between two vertices v i and v j in a graph, then the closeness centrality of a vertex v i could be computed as the sum of the geodesic distances to the vertices v j that are in the same component as v i (a component is the largest set of vertices that are reachable from each other) [37].
Closeness centrality (C c ) is defined as: where i is the starting node, j the target node, and d(i, j) is the distance between them. This measures the distance from the starting node to other nodes in the graph [38]. The closeness centrality captures the accessibility of network components. In the MOFSocialNet, being a network of MOF structures, closeness centrality can identify the most representative MOFs of a given dataset. Table 3 and Figure 5 illustrate the network, highlighting the ten MOFs with the highest closeness centrality.  The closeness centrality ( ) of a vertex is a measure of the closene the rest of the vertices in a graph. The of a vertex is computed as the i of the hop counts (farness) of the shortest paths from the vertex to the r in the graph. If d(i, j) is the geodesic distance between two vertices vi a then the closeness centrality of a vertex vi could be computed as the su distances to the vertices vj that are in the same component as vi (a compo set of vertices that are reachable from each other) [37].
Closeness centrality ( ) is defined as: where is the starting node, the target node, and ( , ) is the distan This measures the distance from the starting node to other nodes in the

Community Detection in MOFSocialNet
At the most abstract level, given a Social network G = (V,E), a community can be defined as a subgraph of the network including a set VC ⊆ V of Social Network entities that are associated with a common element of interest. This element can be a topic, a person of the real world, a place, an event, an activity, or a material such as metal organic framework.
Community detection is a common method in the social graph to categorize a large graph in sub-group with similar features or properties. Methods and algorithms exist for community detection in social networks. In this paper, we used a Louvain community detection algorithm to effectively extract communities. In the Louvain Method of community detection, first small communities are found by optimizing modularity locally on all nodes, then each small community is grouped into one node and the first step is repeated [39].

Community Detection in MOFSocialNet
At the most abstract level, given a Social network G = (V, E), a defined as a subgraph of the network including a set VC ⊆ V of Soci that are associated with a common element of interest. This element c son of the real world, a place, an event, an activity, or a material su framework.
Community detection is a common method in the social graph t graph in sub-group with similar features or properties. Methods and community detection in social networks. In this paper, we used a L detection algorithm to effectively extract communities. In the Louvain nity detection, first small communities are found by optimizing modu nodes, then each small community is grouped into one node and the f [39]. In MOFSocialNet, 24 communities were identified through the Lovain method. In Figure 6, we visualized the graphs of the 24 communities. To improve the readability of the graph, we removed the MOFs node labels. It should be noted that using the Gephi software, a further round of community detection was carried out on each of the 24 communities using the Louvain algorithm. The colors of the graph indicate the different communities. Thus, the nodes of the same color belong to the same community.
One metric to evaluate the extracted community is modularity. Modularity measures how strongly a network can be divided into different communities. Networks with high modularity have dense connections between the nodes within each community but sparse connections between nodes in different communities [39]. The modularity value is in the range of −0.5 to 1. A value of 1 indicates the highest modularity, the modularity of the MOFSocialNet with 1000 MOFs is 0.748. Table 4 lists the different MOFs communities depicted in Figure 6, indicates the number of MOF nodes in each community and the MOF with highest centrality in each community. the graph, we removed the MOFs node labels. It should be noted that using the G software, a further round of community detection was carried out on each of the 24 c munities using the Louvain algorithm. The colors of the graph indicate the different c munities. Thus, the nodes of the same color belong to the same community. One metric to evaluate the extracted community is modularity. Modularity meas how strongly a network can be divided into different communities. Networks with modularity have dense connections between the nodes within each community but sp connections between nodes in different communities [39] . The modularity value is in range of -0.5 to 1. A value of 1 indicates the highest modularity, the modularity of MOFSocialNet with 1000 MOFs is 0.748. Table 4 lists the different MOFs commun depicted in Figure 6, indicates the number of MOF nodes in each community and the M with highest centrality in each community.

Application of MOFSocialNet to Predict the Crystal Density of Unknown MOFs
To evaluate whether properties of new MOFs can be predicted using MOFSocialNet, we randomly chose three MOFs from the CoRE MOFs database. For these MOFs, we excluded the crystal density as input during featurization and placed the MOFs within the MOFSocialNet. We then predicted the crystal density of the new MOFs by simply averaging the crystal density of the ten nearest neighbors. The results show an outstanding prediction accuracy for the crystal density of 99.69% for MOF ABAYOU, 99.79% for ABIXOZ, and 99.96% for ACOLIP. Table 5 presents the density range of the MOFs in the communities.

Use of MOFSocialNet to Predict Gas Adsorption in the Metal Organic Framework
In the final part of our investigation, we evaluated how well the communities extracted from MOFSocialNet can be exploited for predicting gas adsorption properties of MOFs for CO 2 and CH 4 . To evaluate the performance, we compared the prediction performance of MOFSocialNet with three common ML models, namely K Nearest Neighbour (KNN), Gradient Boosting Regression, and Deep Learning. All ML predictions were performed using Rapidminer Machine Learning tools. The efficiency of each ML algorithm was assessed by computing with the mean absolute error (MAE) which is given in Equation (5).
where y i is the prediction value of gas adsorption and x i the true value of respected gas adsorption. The gas adsorption method by MOFSocialNet was performed by (a) creating the MOFSocialNet, as explained in the previous section, (b) extracting communities from MOFSocialNet, and (c) predicting gas adsorption in each individual community. The MAE performance parameter for each individual community is presented using the prediction of CO 2 absorption. Moreover, the overall MAE results were compared to the three main well-known algorithms. Similar to CO 2 prediction, the prediction of gas adsorption for CH 4 is presented for each community, and the overall MAE results were compared to the three main well-known algorithms. Using MOFSocialNet significantly improved the prediction compared to the reference ML algorithms, as shown in the results in Figure 7.  The findings indicate a significant improvement in the prediction presented in Table 6.
where is the prediction value of gas adsorption and the true value of respected gas adsorption.
The gas adsorption method by MOFSocialNet was performed by (a) creating the MOFSocialNet, as explained in the previous section, (b) extracting communities from MOFSocialNet, and (c) predicting gas adsorption in each individual community. The MAE performance parameter for each individual community is presented using the prediction of CO2 absorption. Moreover, the overall MAE results were compared to the three main well-known algorithms. Similar to CO2 prediction, the prediction of gas adsorption for CH4 is presented for each community, and the overall MAE results were compared to the three main well-known algorithms. Using MOFSocialNet significantly improved the prediction compared to the reference ML algorithms, as shown in the results in Figure 7. The findings indicate a significant improvement in the prediction presented in Table  6.

Conclusions and Future Direction
In this paper, we demonstrated that social network analysis (SNA), a tool developed in the social sciences, is well suited to analyse MOF structural databases. The MOFSocialNet was constructed using geometrical descriptors provided in an existing MOF database, CoRE-MOF, yielding an undirected, weighted, and heterogeneous social network. We then used MOFSocialNet as a new tool to guide MOF researchers through the vast chemical space of existing and hypothetical MOFs. For demonstration, we employed SNA to identify the most representative MOFs in this set of research data and to detect MOFs communities, i.e., families of MOFs with similar properties. Furthermore, within each community, the SNA identifies the most representative MOFs structure. Our approach can help to rationally select the most appropriate MOF structures for future studies, such as the most promising or diversified MOF. To demonstrate the feasibility of property prediction via SNA we trained communities extracted from MOFSocialNet, then predicted CO 2 and CH 4 adsorption and evaluated the accuracy of the prediction. Interestingly, SNA outperformed three common ML models, namely K Nearest Neighbour (KNN), Gradient Boosting Regression, and Deep Learning. The proposed SNA approach can accelerate the analysis of MOFs structure, even by increasing the amount of theoretical and experimental data on MOFs. MOFSocialNet as a novel framework can be extended to processing and curating large MOFs databases.